13 research outputs found

    Learning speech embeddings for speaker adaptation and speech understanding

    Get PDF
    In recent years, deep neural network models gained popularity as a modeling approach for many speech processing tasks including automatic speech recognition (ASR) and spoken language understanding (SLU). In this dissertation, there are two main goals. The first goal is to propose modeling approaches in order to learn speaker embeddings for speaker adaptation or to learn semantic speech embeddings. The second goal is to introduce training objectives that achieve fairness for the ASR and SLU problems. In the case of speaker adaptation, we introduce an auxiliary network to an ASR model and learn to simultaneously detect speaker changes and adapt to the speaker in an unsupervised way. We show that this joint model leads to lower error rates as compared to a two-step approach where the signal is segmented into single speaker regions and then fed into an adaptation model. We then reformulate the speaker adaptation problem from a counterfactual fairness point-of-view and introduce objective functions to match the ASR performance of the individuals in the dataset to that of their counterfactual counterparts. We show that we can achieve lower error rate in an ASR system while reducing the performance disparity between protected groups. In the second half of the dissertation, we focus on SLU and tackle two problems associated with SLU datasets. The first SLU problem is the lack of large speech corpora. To handle this issue, we propose to use available non-parallel text data so that we can leverage the information in text to guide learning of the speech embeddings. We show that this technique increases the intent classification accuracy as compared to a speech-only system. The second SLU problem is the label imbalance problem in the datasets, which is also related to fairness since a model trained on skewed data usually leads to biased results. To achieve fair SLU, we propose to maximize the F-measure instead of conventional cross-entropy minimization and show that it is possible to increase the number of classes with nonzero recall. In the last two chapters, we provide additional discussions on the impact of these projects from both technical and social perspectives, propose directions for future research and summarize the findings

    Biased Self-supervised learning for ASR

    Full text link
    Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames. These approaches are evaluated for automatic speech recognition on the Librispeech corpus, where 100 hours of data served as the labelled data and 860 hours as the unlabelled data. The biased training outperforms the unbiased training by 15.5% after 250k updates and 23.8% after 100k updates on test-other. For the streaming models, the pre-training approach yields a reduction in word error rate of 44.1%.Comment: Submitted to ICASSP 202

    Augmenting text for spoken language understanding with Large Language Models

    Full text link
    Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.Comment: Submitted to ICASSP 202

    Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

    Full text link
    Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Full text link
    Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. See voicebox.metademolab.com for a demo of the model

    PENGARUH PEMBERIAN INFORMASI MELALUI MEDIA BOOKLET TERHADAP TINGKAT KEPATUHAN PASIEN DM TIPE 2

    Get PDF
    Adherence is a major component of successful diabetes treatment which is influenced by knowledge and skills regarding disease management. Providing information through health education using a multimedia approach can help patients to master information more effectively, one example is using booklets. This study purposed to determine the effect of providing information through booklet media on the compliance level of type 2 DM patients. This study used a pre-experimental method with a One group pre-test-posttest design. This study included 36 samples selected by purposive sampling technique. Data collection using questionnaires, data analysis consists of univariate and bivariate analysis. This study showed the most results were 29 people or 80.6% were less obedient during the pre-test and the most were 34 people or 94.4% were obedient during the post-test. The results of the Wilcoxon sign rank test obtained Zstats = 4,949> Ztable = 1.96 and P-value = 0.001 <α 0.05, this result can be concluded that the provision of information through the media booklet has a significant effect on the level of compliance of type 2 DM patients. It is recommended that the hospital use booklet media when providing information to type 2 DM patients about diabetes mellitus and its treatment therapy, so that the information conveyed can be easier to understand

    Novel Hepatitis B Virus Capsid Assembly Modulator Induces Potent Antiviral Responses In Vitro and in Humanized Mice

    No full text
    International audienceHepatitis B virus (HBV) affects an estimated 250 million chronic carriers worldwide. Though several vaccines exist, they are ineffective for those already infected. HBV persists due to the formation of covalently closed circular DNA (cccDNA)-the viral minichromosome-in the nucleus of hepatocytes. Current nucleoside analogs and interferon therapies rarely clear cccDNA, requiring lifelong treatment. Our group identified GLP-26, a novel glyoxamide derivative that alters HBV nucleocapsid assembly and prevents viral DNA replication. GLP-26 exhibited single-digit nanomolar anti-HBV activity, inhibition of HBV e antigen (HBeAg) secretion, and reduced cccDNA amplification, in addition to showing a promising preclinical profile. Strikingly, long term combination treatment with entecavir in a humanized mouse model induced a decrease in viral loads and viral antigens that was sustained for up to 12 weeks after treatment cessation

    2′-Chloro,2′-fluoro Ribonucleotide Prodrugs with Potent Pan-genotypic Activity against Hepatitis C Virus Replication in Culture

    No full text
    Pan-genotypic nucleoside HCV inhibitors display a high genetic barrier to drug resistance and are the preferred direct-acting agents to achieve complete sustained virologic response in humans. Herein, we report, the discovery of a β-d-2′-Cl,2′-F-uridine phosphoramidate nucleotide <b>16</b>, as a nontoxic pan-genotypic anti-HCV agent. Phosphoramidate <b>16</b> in its 5′-triphosphate form specifically inhibited HCV NS5B polymerase with no marked inhibition of human polymerases and cellular mitochondrial RNA polymerase. Studies on the intracellular half-life of phosphoramidate <b>16</b>-TP in live cells demonstrated favorable half-life of 11.6 h, suggesting once-a-day dosing. Stability in human blood and favorable metabolism in human intestinal microsomes and liver microsomes make phosphoramidate <b>16</b> a prospective candidate for further studies to establish its potential value as a new anti-HCV agent
    corecore